The Annals of Statistics 

2009, Vol. 37, No. 5B, 2922-2952 

DOI: 10.1214/08-AOS665 

© Institute of Mathematical Statistics, 2009 



PROPERTIES AND REFINEMENTS OF THE FUSED LASSO 

By Alessandro Rinaldo 1 

Carnegie Mellon University 

We consider estimating an unknown signal, both blocky and 
sparse, which is corrupted by additive noise. We study three interre- 
lated least squares procedures and their asymptotic properties. The 
first procedure is the fused lasso, put forward by Friedman et al. 
[Ann. Appl. Statist. 1 (2007) 302-332], which we modify into a differ- 
ent estimator, called the fused adaptive lasso, with better properties. 
The other two estimators we discuss solve least squares problems on 
sieves; one constrains the maximal i\ norm and the maximal total 
variation seminorm, and the other restricts the number of blocks and 
the number of nonzero coordinates of the signal. We derive conditions 
for the recovery of the true block partition and the true sparsity pat- 
terns by the fused lasso and the fused adaptive lasso, and we derive 
convergence rates for the sieve estimators, explicitly in terms of the 
constraining parameters. 

1. Introduction. We consider the nonparametric regression model 

yi = l4 + e h i = l,...,n, 

where fi° G M. n is the unknown vector of mean values to be estimated using 
the observations y, and the errors e$ are assumed to be independent with ei- 
ther Gaussian or sub-Gaussian distributions and bounded variances. We are 
concerned with the more specialized settings where /i° can be both sparse, 
with a possibly very large number of zero entries, and blocky, meaning that 
the number of coordinates where fj, changes its values can be much smaller 
than n. Figure 1 shows an instance of data generated by corrupting a blocky 
and sparse signal with additive noise (see Section 2.4 for details about this 
example). Formally, we assume that there exists a partition {B®, . . . ,Bj } 
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Signal plus noise 
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Fig. 1. Signal (solid line) plus noise for the example described in Section 2-4- 

of {l,...,n} into sets of consecutive indexes, from now on called a block 
partition, and a vector v° £ M. J ° , which may be sparse, such that the true 
mean vector can be written as 

(1.1) M° = f>jV 

i=i 

where lg is the indicator function of the set BC{l,...,n} (i.e., the re- 
dimensional vector whose ith coordinate is 1 if i G B and otherwise). The 
partition {£?,... ,0^}, its size Jo , the vector z/° of block values and its zero 
coordinates are all unknown, and our goal is to produce estimates of those 
or related quantities that are accurate when n is large enough. 

In particular, we investigate the asymptotic properties of three different 
but interrelated methods for the recovery of the unknown mean vector fi° 
under the assumption (1.1). 

The first methodology we study, which is the central focus of this work, 
is the fused lasso procedure of Friedman et al. (2007). The fused lasso is the 
penalized least squares estimator 

(1.2) ju FL = argmin< YVyi - m) 2 + 2Ai >ri ||//||i + 2A 2 ,n||HlTV f, 

msm™ L=i J 
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Fig. 2. A fusion adaptive lasso estimate for the example from Section 2.4, using the 
most biased fusion estimator shown in Figure 3 the oracle threshold for the lasso penalty, 
as described in Section 2.3. 



where ||^||i = Ya=i l/^l i s the P[ norm and ||/x||tv = Ya=2 Ia 4 * ~~ Mi-il the to- 
tal variation seminorm of fj,, respectively, and (Ai )n , \<2 jT i) are positive tuning 
parameters to be chosen appropriately. The solution to the convex program 
(1.2) can be computed in a fast and efficient way using the algorithm de- 
veloped in Friedman et al. (2007), where the properties of the fused lasso 
solution are considered from the optimization theory standpoint. Our anal- 
ysis has led us to propose a modified version of the fused lasso, which we 
call the fused adaptive lasso, that has improved properties. Figure 2 shows 
an example of a fused adaptive lasso fit to the the data displayed in Figure 
1. 

In our second approach, we turn to a different convex optimization pro- 
gram, namely 

n 

argmin VVj/j - m) 2 

(1.3) 

s.t. < L n , WhWtv < T n 

for some nonnegative constants L n and T n . Notice that, in this alternative 
formulation, which is akin to the least squares method on sieves, a solution 
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different from y is obtained provided that ||y||i > L n or 1 1 2/ [ | t v > T n . The 
link with the fused lasso estimator is clear. The objective function in the 
fused lasso problem (1.2) is the Lagrangian function of (1.3), and, in fact, 
the two problems are equivalent from the point of view of convexity theory. 

Our third and final method for the recovery of a sparse and blocky signal is 
also related to sieve least square procedures, and is more naturally tailored 
to the model assumption (1.1). Specifically, we study the solution to the 
highly nonconvex optimization problem 

n 

argmin V(yi - m) 2 

(1.4) 

s.t. \{i : m ^ 0}\ < S n ,l + \{i : m - fii-i ^ 0,2 < i < n}\ < J n , 

where S n and J n are nonnegative constants. Although lack of convexity 
makes this problem computationally difficult when n is large, the theoret- 
ical relevance of this third formulation stems from the fact that (1.3) is, 
effectively, a convex relaxation of (1.4). 

Our approach to the study of the estimators defined by (1.2), (1.3) and 
(1.4) is asymptotic, as we allow the block representation for the unobserved 
signal /i° to change with n in such a way that the recovery of a noisy signal 
under the model (1.1) may become increasingly difficult. Despite being quite 
closely related as optimization problems, from an inferential perspective, the 
three procedures under investigation each shed some light on different and, 
in some way, complementary aspects of this problem. 

Overall, our analysis yields conditions for consistency of the block parti- 
tion and block sparsity estimates by model (1.2) and its variant described 
in Section 2.3, and explicit rates of consistency of both sieve solutions (1.3) 
and (1.4). In essence, our results provide conditions for the sequences of 
regularization parameters Ai >n , \2,n, L n , J n and S n to guarantee various 
degrees of recovery of 

The article is organized as follows. In Section 2, we study the fused lasso 
estimator. After deriving an explicit formula for the fused lasso solution in 
Section 2.1, we establish conditions under which the fused lasso procedure 
is both sparsistent, in the sense of being a weakly consistent estimator of 
the partitions, and of the set of nonzero coordinates of /jP. In Section 2.3, 
we propose a simple modification of the fused lasso, which we call the fused 
adaptive lasso, that achieves sparsistency under milder conditions and also 
allows us to derive an oracle inequality for the empirical risk. Finally, in 
Section 3, we derive consistency rates for the estimators defined in (1.3) and 
(1.4), which depend explicitly on the parameters L n and T n , and of S n and 
J n , respectively. The proofs are relegated to the Appendix. 

We conclude this introductory section by fixing the notation that we will 
be using throughout the article. For a vector /i £ IR n , we let S(ji) = {i: /ii 7^ 
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0} denote its support and = {i: [i% = Hi-i 7^ 0, i > 2} the set of coor- 

dinates where /i changes its value. Furthermore, notice that we can always 
write 

J 

3=1 

from some (possibly trivial) block partition {B\, . . . ,Bj}, with 1 < J < n, 
and some vector v G M" 7 . Then, we will write JS{p) = {j : Vj 7^ 0} for the 
sets of nonzero blocks of [i. On a final note, although all the quantities 
defined so far may change with n, for ease of readability, we do not always 
make this dependence explicit in our notation. 

1.1. Previous works and comparison. The idea of using the total vari- 
ation seminorm in penalized least squares problem has been exploited and 
studied in many applications (e.g., signal processing, parametric regression, 
nonparametric regression and image denoising). From the algorithmic view- 
point, this idea was originally brought up by Rudin, Osher and Fatemi 
(1992) [for more recent developments, see, e.g., Dobson and Vogel (1997) 
and Caselles, Chambolle and Novaga (2007), and also DeVore (1998)]. The 
original motivation for this article was the recent work by Friedman et al. 
(2007), who devise efficient coordinate- wise descent algorithms for a variety 
of convex problems. In particular, they propose a novel approach based on 
a penalized least squares problems using simultaneously the total variation 
and the t\ penalties, which favors solutions that are both blocky and sparse. 
In the classical nonparametric framework of function estimation, two impor- 
tant contributions in the development and analysis of total variation-based 
methods come from Mammen and van de Geer (1997) and Davics and Ko- 
vac (2001a). Specifically, Mammen and van de Geer (1997) show that least 
squares splines with adaptively chosen knots are solutions to nonparamet- 
ric least squares penalized regression problems with total variation penalties 
and derive, among other things, consistency rates for both the one- and two- 
dimensional case. Using a different approach, Davies and Kovac (2001a) de- 
vise a very simple and effective procedure with 0(n) complexity, called the 
taut-string algorithm, which effectively solves least squares problems with 
total variation penalty. The taut-string can be used to consistently estimate 
at an almost optimal rate the number and location of local maxima of an 
unknown function on [0, 1] . Both methods impose very little assumptions 
on the degree of smoothness of the true underlying function. More recently, 
Boysen et al. (2009) study jump-penalized least squares regression problems, 
where the underlying function is assumed to be a linear combination of in- 
dicator functions of intervals in [0,1], and derive consistency rates under 
different metrics on functional spaces. 
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Our work differs from the contributions based on a nonparametric func- 
tion estimation framework of, in particular, Mammen and van de Geer (1997) 
and Davies and Kovac (2001a) in various aspects, some of which are closely 
related to the methodology and scope of Friedman et al. (2007). First and 
foremost, we are interested in the asymptotic recovery of the coordinates of 
the mean vector fj, under the model assumption (1.1), and do not neces- 
sarily view them as n evaluations of an unknown function defined on [0, 1]. 
Secondly, we explicitly impose a double asymptotic framework in which the 
model complexity and the features of the underlying signal change with n. 
This, in particular, allows us to include cases in our analysis where the num- 
ber of blocks or the number of local extremes grow unbounded with n, a fea- 
ture which typically cannot be directly accommodated in the nonparametric 
framework. Nonetheless, we remark that there is a simple reformulation of 
our problem as nonparametric function estimation one. In fact, suppose that 
we observe n datapoints of the form 

Ui = n fj, (t)dt + £i, i = l,...,n, 

J(i-l)/n 

from an unknown function fjP : [0, 1] — > R. Setting Hi = n J^™^^ fi°(t) dt 
would return our original model [see also Boysen et al. (2009) for a similar 
model]. Furthermore, for the analysis of Section 2, we are only concerned 
with the simultaneous recovery of both the block partition and of the spar- 
sity pattern of fi° and virtually ignore any other features of the signal. On 
the one hand, this allows us to derive rather strong results, namely sparsis- 
tency and the oracle inequality of Theorem 2.7. On the other hand, those 
results are truly meaningful only when our modeling assumptions (1.1) of 
a blocky and sparse signal hold, and our analysis should not be expected 
to be robust to mispecification. In particular, the fused lasso and adaptive 
fused lasso algorithms should not be expected to work well, both in practice 
and in theory, with different kinds of signals. 

2. Properties and refinements of the fused lasso estimator. The crucial 
feature of the fused lasso solution (1.2), which makes it ideal for the present 
problem, is that it is simultaneously blocky, because of the total variation 
penalty || ■ ||tv> an d sparse, because of the t\ penalty || • ||i- The central goal of 
this section is to characterize the asymptotic behavior of the regularization 
parameters Ai, n and \2,n, so that, as n — ► oo, the blockiness and sparsity 
pattern of the the fused lasso estimates match the ones of the unknown 
signal fjP with overwhelming probability. We first consider the fused lasso 
estimator as originally proposed in Friedman et al. (2007) and then a simple 
variant, the fused adaptive lasso, which has better asymptotic properties. 
For this modified version, we also derive an oracle inequality. We will make 
the following simplifying assumption on the errors: 
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(E) The errors £j, 1 < i < n are identically distributed centered Gaussian 
variables with variance such that a n — ► 0. 

In the typical scenario we have in mind, a n = -^=. Assumption (E) is by no 
means necessary, and it can be easily relaxed to the case of sub-Gaussian 
errors. 

2.1. The fused lasso solution. Below, we provide an explicit formula for 
the fused lasso solution that offers some insight on its properties and suggests 
possible improvements. By inspecting (1.2), as both penalty functions || • ||i 
and || • 1 1 tv are convex and the objective function is strictly convex, /I FL is 
uniquely determined as the solution to the subgradient equation 

(2.1) /I FL = y - Xi, n si - A 2 , n S2, 

where s\ € (9||7i FL ||i and S2 £ <9||/2 fl ||tv- F° r a vector x € W 1 , the subgradient 
<9||x||i is a subset of W 1 consisting of vectors s such that Sj = sgn(rcj), where, 
with some abuse of notation, we will denote with sgn(-) the (possibly set- 
valued) function on K given by 







if x > 


sgn(x) = | 




if x < 






if x = 



where z is any number in [—1,1]. The subgradient 3||x||tv nas a slightly 
more elaborated form, which is given in Lemma A.l in the Appendix. 

An explicit expression for /2 FL can be obtained in terms of the fusion 
estimator 

(2.2) J2 F = argmin< VVy, - m) 2 + 2A 2 ||HItv >■ 

MSM" [ i=1 J 

Notice that, by the same arguments used above, J2 F is also unique. This 
fusion estimator solves a regularized least squares problem with a penalty 
on the total variation of the signal and works by fusing together adjacent 
coordinates that have similar values to produce a blocky estimate of the form 
(1.1). We remark that, in the nonparametric function estimation settings, 
one can obtain f2 F as a piecewise-constant variable-knot spline function on 
[0,1] [see Mammen and van de Geer (1997), Proposition 8] and that the 
taut-string algorithm of Davies and Kovac (2001a) solves the constrained 
version of (2.2). 

For a given solution jl F to (2.2), there exists a block partition {B\, . . . , Bj} 
and a unique vector v £ W J such that 

J 

(2.3) M F = E% 1 £- 

3=1 
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We take note that both the number J and the elements of the partition 
{£>i, . . . ,B j} are random quantities, and that, by construction, no two con- 
secutive entries of v are identical. Using (2.3), the individual entries of the 
vector v can be obtained explicitly, as shown next. 

Lemma 2.1. LetvG R J satisfy (2.3) and bj = \Bj\ for 1 < j < J . Then, 

1 ^ 



where 



(2.4) 



Cl 



A2,n 
A2,n 



(2.5) 

and, for 1 < j < J, 
(2.6) cj = 



^2,n 



if v 2 -v\> 0, 
if V 2 -V 1 < 0, 

ifvj - > 0, 



2A 



2,n 



2A2,n 



lo, 



if — Vj > 0, 9j — Vj^i < 0, 

if — Vj < 0, Vj — Vj^i > 0, 
if - %-i)(%+i -%) = !• 



By Proposition 1 in Friedman et al. (2007), the fused lasso estimator is 
obtained by soft-thresholding of the individual coordinates of fi F , so that 
we immediately obtain the next result. 



Corollary 2.2. The fused lasso estimator fl FL is 
(2.7) 

where j2 F is the fusion estimator. 



l4 L ={0, |/2f|<Ai, n , i = l,...,n, 



Remarks. 
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1. As is apparent from Lemma 2.1, the individual blocks found by the fusion 
solution p, F are each biased by a term whose magnitude depends directly 
on the regularization parameter A2, n and, inversely, on the size of the 
estimated block itself. That is, the larger the estimated blocks the smaller 
the effect of the bias. This term is simply a vertical shift, which is positive 
if the block is a local maximum, negative if it is a local minimum, and is 
zero otherwise. See Figure 3. It is worth pointing out that, as expected, 
the solution obtained using the taut-string algorithm of Davies and Kovac 
(2001a) with global squeezing exhibits exactly the same behavior, with 
the magnitude of the vertical shift being controlled by the size of the tube 
around the integrated process instead of the penalty term \2, n - 

2. The regularization parameter Ai, n modulates the magnitude of the spar- 
sity penalty and induces some bias effect as well. However, unlike the bias 
determined by the total variation penalty, this second type of bias is of 
the same magnitude for all the nonzero coordinates, a fact that can be 
seen directly from (2.7). An easy fix, which is considered in Section 2.3, 
is to adaptively penalize the estimated blocks differently, depending on 
their sizes, with larger blocks penalized less. 

2.2. Sparsistency for the fused lasso. In this section, we provide condi- 
tions under which the block partition {B®, . . . ,Bj Q } and the block sparsity 
pattern JS(n°) of /i° can be estimated consistently [see (1.1)] by the fused 
lasso procedure. We break down our analysis into two parts, dealing sep- 
arately with the fusion estimator p, F first, which can be used to recover 
{B®, . . . , Bj Q }, and then with the fused lasso solution /i FL , from which the 
set JS(fi°) can be estimated. In Section 2.3, we show how this second task 
can be accomplished more effectively by a modified version of the fused lasso 
estimator. 

2.2.1. Recovery of true blocks by fusion only. We first derive sufficient 
conditions for the fusion estimator to recover correctly the block partition 
of Let Jo = J~(fi°) be the set of jumps of fi° and Jo = \Jo\ + 1 t ne 
cardinality of the associated block partition. Similarly, let J = J{fi F ) be 
the set of jumps for the fusion estimate given in (2.3). 

Theorem 2.3. Assume (E) and (1.1). If, for some 5 > 0: 
1. oo and > + 6), 

2 b^On ^ b° * > + md < ^ - 

CTn a„^log J 4 

where a n = minj e j- — $-\\ and 6^ in = mini<j<j 6°. Then, 

(2.8) limF({J = J } n {sgn(/2f - = sgn(^° - M^Vt G Jo}) = 1- 
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Fig. 3. Different fusion estimates for the data described in Section 2.4- The dashed line 
corresponds to the true mean vector, while the three lines correspond to the fusion estimates 
with different regularization parameters. 



Remarks. 

1. In the proof of Theorem 2.3, instead of Slepian's inequality, one could use 
Markov's inequality and well-known bounds on the supremum of centered 
sub-Gaussian vectors [see, e.g., Lemma 2.3 in Massart (2007)] to derive 
slightly stronger sufficient conditions for (2.8), which however hold for 
the larger class of sub-Gaussian errors. We give the following conditions 
without a proof: 

(a) hm n W2W(mo)| + 2A 2 ,„ = Qj 

(b) lim n ±M= = oo. 

Furthermore, the errors need not be identically distributed. In fact, the 
proof of the theorem holds almost unchanged if, for example, one only 
assumes that the individual variances are of order 0(l/y/n). 

2. Equation (2.8) actually implies not only that J7b can be consistently es- 
timated, but also that the true signs of the jumps can be recovered with 
overwhelming probability, a feature known in the lasso literature as sign 
consistency [see, e.g., Wainwright (2006) and Zhao and Yu (2006)]. In 
the present settings, sign consistency of the fusion estimate implies the 
following desirable feature of fi, F : 



PROPERTIES AND REFINEMENTS OF THE FUSED LASSO 



11 



Corollary 2.4. The fusion estimator p, can consistently recover 
the local maxima and local minima of /j, . 

3. The magnitude a n of the smallest jumps of /jP is a fundamental quan- 
tity, whose asymptotic behavior determines whether recovery of the true 
blocks obtains. In particular, if a n vanishes at a rate faster than \Jtf airi /o~ n , 
then no recovery is possible. In a way, this guarantees some form of 
asymptotic distinguishability that prevents adjacent blocks from looking 
too similar. 

4. The larger the minimal size of a block 6^j n , the easier the recovery of the 
blocks by fusion. 



2.2.2. Recovery of true blocks and true nonzero coordinates by the fused 
lasso. Let JSq = J'S(fi°) be set of nonzero blocks of fjP and Kq = \ J~So\ 
its cardinality. Let JS = J\S(/I FL ) be the equivalent quantity defined using 
the fused lasso estimate /2 FL . Consider the event 

Kl, n = {JS = JS}n {sgn(%) = sgn(^°), Vj e JS } 

that soft-thresholding fi F with penalty Ai jn will return the nonzero blocks 
of fjP. 



Theorem 2.5. If the conditions of Theorem 2.3 are satisfied and, for 
some 5 > 0: 

1. Al '"^" oo and > 2y/2(l + S); 

2. 2 ,o 2 '" < ^f-, for all n large enough; 

min 



3 P"V min _^ 0O) P "Y mi " > Vl8(l + 5) and Ai n < for all n large enough; 

a„y/logK 



4- 2 ,o 2 '" < , for all n large enough, 



where p n = minj g ^ \ u< j\> then, 



j 

limP(fti,„) = 1. 



Remarks. 

1. As was the case for Theorem 2.3, the assumption of Gaussian errors is not 
essential and can be relaxed, and, in fact, Remark 1 above still applies. 

2. The previous result implies that the fused lasso is not only consistent 
but, in fact, sign consistent, so that the signs of the nonzero blocks are 
estimated correctly. 
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3. The magnitude p n of the smallest nonzero block value cannot decrease 
to zero too fast, otherwise the sparsity pattern cannot be fully recovered, 
just as we pointed out above in Remark 3 for the fusion solution. 

4. The conditions of Theorem 2.5 appear to be quite cumbersome for two 
main reasons. First, the regularization parameters Ai jn and \2, n interact 
with each other. As a result, it appears necessary to impose assumption 
2 in order to guarantee that the two different bias terms they each de- 
termine will not disrupt the recovery process. Secondly, one has to keep 
track of the size 6^ in of the minimal block. This additional bookkeeping 
is due to the fact that the sparsity penalty is enforced globally, in the 
sense that all coordinates are penalized in equal amount, thus ignoring 
the fact that longer blocks require less regularization (see Remark 1 after 
Lemma 2.1). 

2.3. The fused adaptive lasso: Sparsistency and an oracle inequality. Mo- 
tivated by the stringent nature of the conditions of Theorem 2.5, below we 
propose a refinement of the fused lasso estimator, which we call the fused 
adaptive lasso. Overall, this slightly different estimator enjoys better asymp- 
totic properties than the fused lasso, at no additional complexity cost. 

The fused adaptive lasso is obtained with the following two-step proce- 
dure: 

1. Fusion step. Compute the fusion solution p, F using the fusion regulariza- 
tion parameter \2, n , as in (2.2), and the corresponding block-partition 
(B x ,...,Bj) [see (2.3)]. Obtain 

(2-9) £ AF = X>l£, 

J=l 

where 

j i^Sj 

2. Adaptive lasso step. Compute the fused adaptive lasso solution 

n 

(2.10) /2 FAL = argmin ||/I AF - fi\\ 2 2 + V A^l, 

^eK" i=1 

where the n-dimensional random vector A of penalties is 

(2.11) A = A^^ 
with Ai n as the i\ regularization parameter. 
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Remarks. 

1. The fused adaptive lasso differs from the fused lasso in two fundamen- 
tal aspects. First, as easily seen from (2.9), the bias term in the fusion 
solution due to the terms Cj, which depends on the regularization param- 
eter \2,nt is absent (see Lemma 2.1). Equivalently, the fusion estimator 
is only used to estimate the block partition of fi°, and, provided this 
estimate is correct, the block values are estimated unbiasedly with the 
sample averages. Using the fusion procedure as an estimator of the block 
partition has the other advantage of decoupling the estimation from the 
model selection problem, thus freeing, to some extent, the user from the 
task of carefully choosing an optimal penalty A2, n - In fact, recovery of 
the true partition can be obtained even if the problem is overpenalized, 
and, therefore, the resulting estimator p, is highly biased. 

Secondly, the penalty terms used for thresholding individual blocks are 
rescaled by the squared root of the length of the estimated blocks. The 
rationale for using this rescaling is very simple. In fact, suppose that, 
for some ji , j2 , bj t 3> bj 2 . Since the variance of the j th block average yj 

is y 2 -, yj 1 has a much smaller standard error than yj 2 and, therefore, 
should be penalized less heavily. The adequate reduction in the sparsity 
penalty of yj 1 versus y,- 2 is precisely the difference in their standard errors, 
hence the choice of rescaling by the square root of the block lengths. The 
advantage of adaptively thresholding the block values in this manner is 
that the procedure will be more effective at identifying longer nonzero 
blocks whose values are quite close to 0. 

In Section 2.4 we explain both these improvements concretely with a 
numerical example. 

2. In step 2 the vector jl is straightforward to compute via soft-thresholding 
of the individual coordinates of /x AF with coordinate-dependent thresh- 
olds 



3. Instead of the soft-thresholded block estimate of step 2, one may consider 
instead the corresponding estimate p, based on the hard-threshold where 



One of the asymptotic advantages of the fused adaptive lasso versus the 
ordinary fused lasso is that block recovery obtains under milder conditions 
than Theorem 2.5, without the need to consider the fusion penalty parameter 
\2,n and the length of the minimal block. In some sense, the fused adaptive 
lasso can adapt more flexibly to the block sparsity than the fused lasso. 




1 < % < n. 



p^rmr \>\i} 



1 < i < n. 
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Proposition 2.6. Assume that the conditions of Theorem 2.3 are sat- 
isfied. Then, 

limP{fti, n } = 1, 

n 

if, for some 5 > 0, 

1. ^ -»■ oo and H^S > y/2(l + 8); 

2. — > oo. f" > 2\/2(l + (5) and Ai „ < % for a// n large enough, 

°~ n (T„ylogftr ^ 

w/jere p n = min, e x; 

A second advantage of the fused adaptive lasso stems from the oracle 
property derived below. Consider the ideal situation where we have access 
to an oracle who lets us know the K° sets Bj k , k = 1, . . . , K°, of the true block 

partition of /i° for which \i/j k \ > &n/\Jbj k . Notice that, from this information, 

one can recover the true partition. The oracle estimate yP is the vector with 
coordinates 



P? 




tfieB jk , 



otherwise. 



This procedure amounts to setting to the estimates for the coordinates 
belonging to the blocks whose true mean value is smaller than a n /^Jt/j. The 
corresponding ideal risk is 

E||A° - AI = EE U< € ^}min{^, (u°)'< 
(2.12) 

= K al+ E » 2 - 

Note, in particular, that 

E||/2°-//l|<E™ n {^,/4} 

i 

with equality if and only if &!■ = 1 for all j , where the expression on the right- 
hand side is the ideal risk for the oracle estimator based on thresholding of 
individual coordinates rather than of blocks. Therefore, if /jP has a block 
structure, as is assumed here, this different oracle will be able to achieve a 
smaller ideal risk. 
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Before stating our oracle result, we need some additional notation. Recall 
that any fi £ M n can always be written as 

J 

(2.13) M = 5>ls, 

i=i 

for some (possibly trivial) block partition (Bi, . . . ,Bj) of {1, ...,n}, with 
J <n. Let /i 1 and fj? be vectors in W 1 with block partitions {£>}, . . . ,Bj t } 
and {B 2 , . . . , £>j 2 }, respectively, where Ji, J2 < n. Then, they satisfy (2.13), 
for some vectors z^ 1 E R Jl and v 2 G R" 72 , respectively. Let {£1, . . . ,£ m } be 
the partition of {1, . . . , n} obtained as the refinement of the block partitions 
of fi 1 and fi 2 , that is, for every I = 1, . . . , m, L[ = B^ n B 2 2 , for some j\ and 
jz- We define the quantity 

JSip 1 -^ 2 ) = {1:4 = B l n n B% , v\ ± 0}. 



Theorem 2.7. Assume that /uP satisfies (1.1) and that 



(2.14) 




Let a 2 = — , A2,n = A^a 2 logn, with A > suc/i that A2, n a n < 1/4 and 

Ai jn = 2\Ja^\ogJ , where J is obtained by solving the fusion problem (2.2) 
in the first step of the adaptive fused-lasso procedure. For any vector fj, G IR n , 
set 

V(fj,)=32\JS(»;v )\o- 2 n logJ . 
Then, for any 5 £ [0, 1), 

(2.15) limP j||£ FAL - Molll < ^jgSnti + llM - = 1- 

Remarks. 

1. The assumption in (2.14) stems from Theorem 2.3 and is crucial in our 
proof, as it guarantees that recovery of the true block partition of /jP by 
fusion, which is necessary for mimicking the oracle solution yP ', is feasible. 
It essentially allows for consecutive blocks to differ by a vanishing quantity 
of smaller order than \/\og n/n. If the minimal jump size is bounded away 
from zero, uniformly in n, then the condition \2, n ctn < 1/4 is redundant. 
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2. The proof of Theorem 2.7 shows that V(/x) is minimized by vectors such 
that 

\JS(w°)\ = \JS(»°)\=K°; 

that is, vectors whose block partition matches the the true block partition. 
Therefore, (2.15) shows that the adaptive fused-lasso achieves the same 
oracle rates granted by ideal risk (2.12) up to a term that is logarithmic 
in J . 

3. If it is further assumed that H/^Hoo < C uniformly in n, for some constant 
C, the result (2.15) can be strengthened to 

nn - A^oii! < + Ha 1 - a*°iii> + o(i). 

2.4. ^4 toy example. We discuss a stylized numerical example for the 
purpose of clearly illustrating the two advantages of the fused adaptive lasso, 
namely the use of the fusion penalty only for recovering the true block 
partition and the block-dependent rescaling of the lasso penalty. See Remark 
1 before Proposition 2.6 for details. 

We simulate one sample according to the model 

y i = $ + e il 

where 



ro, 


1 < i < 


100, 


2, 


101 <i 


< 110 


-0.1, 


111 <i 


<210 


-2, 


211 <i 


<220 


0, 


221 <i 


<320 


2, 


321 <i 


<330 


I 0.1, 


331 <i 


<430 



and the errors are independent Gaussian variables with mean zero and stan- 
dard deviation a = 0.2. Figure 1 shows the data along with the true signal. 
Notice that some of the coordinates of /jP are in absolute value less than a, 
a fact that, as we will see, if fi° were not blocky, would make the recovery of 
those coordinates infeasible. Figure 3 portrays the simulated data and three 
fusion estimates p, F , each of them solving (2.2) for three different values of 
\2,n- 4.8, 6.8 and 7.8. The dashed line corresponds to the true mean vector 
H . The excessive amount of penalization is apparent from the large bias in 
all these estimates, especially in the smaller blocks. Nonetheless, the block 
partitions that each of these estimates produce match, in fact, very closely 
the true block partition. 

Figure 4 shows the modified fusion estimate /2 AF given in (2.9) using the 
fusion estimate from Figure 3 with the largest amount of bias, along with 
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•» •. . - . a/ 



Signal 

Modified Fusibn Estimate 



'A 



! » •»:•■■•■ 



i i i i i 

100 200 300 400 

Fig. 4. The modified fusion estimate /i AF of (2.9), using the fusion estimate from Fig- 
ure 3 with the lowest total variation. The dashed gray line, which is almost indistinguishable 
from the estimate, is the true signal fi° . The vertical lines enclose the third and seventh 
blocks, whose value is in magnitude half the standard deviation of the errors. 



the true mean vector p, , displayed as a dashed line. Because the block par- 
tition was estimated correctly, the estimate £ AF is almost indistinguishable 
from the true vector fjP. For this particular dataset, the adaptive lasso step 
would set to zero correctly the first and fifth block, but not the third and 
seventh blocks, which in Figure 4 are enclosed by black vertical lines. In fact, 
although the true value of those blocks is in magnitude half the standard 
deviation of the errors, a, the standard error for both the block estimates 
is roughly a/10. This is taken into account in the adaptive lasso step, but 
not in the lasso step, where even the ideal soft threshold, that is a, would 
be too high, thus incorrectly setting to zero both of these blocks. 

Finally, we simulated 1000 datasets according to the model described here 
and computed the empirical mean squared errors for the fused adaptive lasso 
estimates, using for the penalty terms the values indicated in Theorem 2.7. 
Figure 5 shows the histogram of the empirical mean squared errors, with the 
vertical line representing the true mean squared error ^E||y — /i°|| 2 , namely 
a 2 . Notice how the empirical mean squared errors are larger then the true 
value, the usual price paid for adaptivity. 
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Fig. 5. Distributions of the empirical mean squared errors from 1000 simulations from 
the model described on Section 2.4 using the fused adaptive lasso with penalty parameters 
chosen according to Theorem 2.7. The vertical line represents a . 



2.5. How to choose Ai n and \2, n - From the practical standpoint, the 
choice of the regularization parameters is crucial. For the fused adaptive 
lasso, one can infer from the proof of Theorem 2.7 that the optimal choice 
for the vector of lasso penalty terms A is given by 

I J i 

(2a nV / logJ)^-^l^., 

with lg denoting the indicator vector of the estimated block Bj, 1 < j ' < J. 

This choice corresponds to soft-thresholding J independent Gaussian vari- 

ables with variances -^=, j = 1, . . . , J. 

Admittedly, for the total variation penalty term A2, n the theoretical re- 
sults presented here, being of asymptotic nature, may not directly lead to 
procedures that are effective in practice, unless n is very large. Choosing op- 
timal values for the penalty parameters remains an important open problem 
in much of the penalized least-squares literature, where the theoretical (e.g., 
asymptotic) results may offer little guidance in practice. Cross validation 
is certainly a viable way of choosing both Ai n and \2,n, as recommended 
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in Friedman et al. (2007), and as is almost exclusively done in practice (al- 
though it remains to be seen whether this procedure leads to optimal estima- 
tors). Nonetheless, an automatic procedure for choosing A2, n that exhibits 
reasonable performance still eludes us. However, our theoretical analysis, 
and the toy example presented above, shows that, if the signal is comprised 
mostly of long blocks, a large value of A2, n will lead to accurate estimates of 
the block partition, and the results should be relatively robust to different 
choices. 

An interesting possibility suggested by a referee, which is beyond the scope 
of this article, is to replace the overall total variation parameter A2, n with 
a series of data-driven parameters, one for each term of the total variation 
seminorm. Specifically, one can consider the penalized problem 



where {\2,i,i = 2, . . . , n} are possibly different coefficients that modulate the 
effect of the total variation penalty at different locations along the signal, 
so that the solution is more robust to spurious local extreme due to un- 
usually large errors. In fact, as pointed out by Davies and Kovac (2001b), 
the taut-string algorithm with local squeezing approximates the solution to 
this problem. Although local squeezing increases the complexity of the al- 
gorithm, it has been shown to enjoy a better performance than the problem 
with an omnibus total variation penalty. The choice of the regularization 
parameters {A2,i, i = 2, . . . , n} can be done iteratively, starting with all A2,i's 
being identical and very large (thus producing an estimate with constant 
entries) and then, at every step, shrinking them differently based on the 
features of the residuals, such as the multiresolution coefficients as defined 
in Davies and Kovac (2001a). 

3. Sieve methods. In this section, we study the rates of convergence for 
the sieve least squares solutions (1.3) and (1.4). For convenience, consis- 
tency is measured with respect to the normalized Euclidean norm ||x|| n = 




-^yJ2i=i x i- Accordingly, we change our assumption on the errors as fol- 
lows: 

(E') The errors (ei,...,e n ) are independent sub-Gaussian variables with 
variances bounded by c 2 , uniformly in n. 

Notice that the results and settings of previous sections can be adapted in 
a straightforward way to the present framework. 

We first study the estimator given in (1.3). To that end, consider the class 
of vectors 



(2.16) 






C T v(T n ) = {n e R n : HHItv < T n , HmHoo < C} 
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where C is a finite constant that does not depend on n, and the £i-ball of 
radius L n 

C h {L n ) = {/; £R n -.\\nWx < L n } 

with both numbers T n and L n being allowed to grow unboundedly with n. 
Then, we can rewrite (1.3) as 

J • ii 1 1 2 

fj, = argmm ||y — /i|| 2 . 

neC TV (T n )nC^{L n ) 

Below, we derive the consistency rate for p, TL in terms of the sequences T n 
and L n by dealing separately with the two sieves. 



Theorem 3.1. Assume (E') and fi° € C^{L n ) n CTv(T n ). Let 

2 
: 



(3.1) j2 T = inf ||y-^" 2 



mgCtv(TW) 
and 

fi L = argmin \\y — /x 1 1 2 - 

Then, 

\\n T -A\n = P {T^n-^) 1 
so that p, T is consistent provided that T n = o{n), and 



(3.2) \\?-A\« = P L - i " (1 ° S " )3/2 



so £/ia£ p, is consistent provided that 

n 

L n = o 

As a result, 



(logn) 3 / 2 / 



(3.3) \\jF L -fj?\\ n = P ( nK & J A 1 



n \ n 

Remarks. 

1. It appears that the requirement for the vectors in CTv{T n ) to be uni- 
formly bounded cannot be relaxed without negatively affecting the rate 
of consistency or without introducing additional assumptions [see, e.g., 
Theorem 9.2 in van de Geer (2000)]. 
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2. The rate of consistency for p, should be compared with the analogous 
rate derived in Theorem 9 of Mammen and van de Geer (1997) for the 
penalized version of the least squares problem (3.1). 

3. The rate given in (3.2) is not the sharpest possible. In fact, an application 
of Theorem 5 of Donoho and Johnstone (1994) yields for fi L the improved 
minimax rate 



V n 



for the case of i.i.d. Gaussian errors, from which we can infer a maximal 
rate of growth L n = o( , n ). 

\/loen 



4. We make no claims that the rate given in equation (3.3), which is just 
the minimum of the rates for two separate sieve least squares problems, is 
sharp. Better rates may be obtained from better estimates of the metric 
entropy of the set C^(L n ) r\CTv(T n ). 

5. On the relationship between L n and T n . The total variation and l\ con- 
straints are not independent of each other. One can easily verify that 

T™ ax = max ||x||tv = 2L n . 
xeC £l (L n ) 

On the other hand, every vector x £ R™ such that ||x||tv = can be 
written as 

x = m + t, 

where 1 1 1 1 1 t v = T n , m = l n x n , with x n = i Y,i x ii an d \ Ei m iU = °- No- 
tice that m can be estimated at the rate -4= , so the convergence rates for 

Ji T depend on how well t can be estimated. Next, notice that 



rmax — 11 + 11 

Li = max Ml 



T n n 



x&C TV (T n ),x=m+t 2 U— 1' 

where m + 1 is the decomposition of x discussed above. Therefore, over 
the set CTv(T n ) nC^(L n ), we obtain the relationship 



(3.4) T n ~ 2L n 



max 



Our final result concerns the estimator resulting from the nonconvex sieve 
least squares problem (1.4). Define the set 

C(S n , Jn) = {// G R n : |5„(At)| < S n } n {/I £ 1" : |Jn(/x)| + 1 < J n }, 

consisting of vectors in M n that have at most S n nonzero coordinates and 
take on at most J n different values. For convenience, we further impose the 
following, fairly weak assumption, which does not preclude the coordinates 
of /jP from becoming increasingly large in magnitude with n: 
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(R) the set C(S n , J n ) is contained in a 5 n -dimensional cube centered at the 
origin with volume R n such that 

log R n = o(n). 



Mill 



Theorem 3.2. Assume (E') and (R) and let p, SJ = argmin^g^^j^ \\y- 
2 

2' 

L VSn = °(v&i), then 
(3.5) ||/2 5J -/i ||„ = O P | 

2. When S n = n, (3.5) still holds, provided J n = o(j^j). 
Remarks. 

1. The rate on is in accordance with the persistence rate derived in 
Greenshtein (2006), Theorem 1, for related least squares regression prob- 
lems on sieves. 

2. If Jo is bounded, uniformly in n, the consistency rate we obtain is para- 
metric. See Boysen et al. (2009) for a similar result. 



4. Discussion and future directions. In this work, we tackle the task of 
estimating a blocky and sparse signal using three different methodologies, 
whose asymptotic properties we investigate. We study the fused lasso estima- 
tor proposed in Friedman et al. (2007) and a simple variant of it, with better 
properties. For both procedures, we provide conditions under which they re- 
cover with overwhelming probability as n gets larger the block partition. We 
also study consistency rates of sieve least square problems under two types 
of constraints, one on the maximal radiuses of the t\- and || • ||tv -balls, 
and the other on the maximal number of blocks and nonzero coordinates. 
Overall, these results complement each other in providing different types of 
asymptotic information for the task at hand and complement other analyses 
already existing in the statistical literature. 

There are a number of generalizations of the results presented. We men- 
tion only the ones that seem the most natural to us. A first extension involves 
considering a corrupted version of a signal fi° £ W 1 x M. n , corresponding to 
the problem of denoising a sparse, blocky image over a n x n grid, for which 
total variation methods have proven quite effective. Another interesting di- 
rection would be to assume a known slowly-varying variance function, for 
example, with given Lipschitz constant, and incorporate this information di- 
rectly into the penalty functions in both the fusion and adaptive lasso steps. 
Furthermore, under this heteroschedastic scenario, one could first build a 
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consistent estimator of the variance function and then, in the fusion step, 
use it to penalize the individual jumps adaptively. We think that our tech- 
niques and results can be directly generalized to study these more complex 
settings. Finally, we believe it would be quite valuable to investigate the 
possibility of building confidence balls and, in particular, confidence bands 
for the entire signal or for some of its local maxima or minima based on the 
estimators considered here. 



APPENDIX: PROOFS 

Lemma A.l. Let \\ ■ || TV —> R be the fused penalty ||x||tv = Ya=2 \ x %~ 
Xj_ i|. Then, || • ||tv *s convex and, for any x G R k , the subdifferential 9||x||tv 
is the set of all vectors s G R k such that 

( -w 2 , ifi = l, 

(A.l) Si = < Wi - Wi+i, ifl< i<k, 

[w k , ifi = k, 

where Wi = sgn(xj — Xi-i), for 2 < i < k. 

Proof. Let Lbea(fc-l)xfc matrix with entries Lj j = — 1 and Lj j+i = 
1 for 1 < i < {k — 1) and otherwise. Then, for any x G M. k , ||x||tv = 
Convexity of || • ||tv follows from the fact that it is the composition of a 
linear functional by the l\ norm, which is convex. Next, by the definition of 
the subdifferential of the i\ norm, for any y G 

(A.2) WLyW^WLxWi + iL^-x),™) 

holds if and only if w G W x C where W x is the set of all vectors w 

such that Wi = sgn((Lx)j). Equation (A.2) is equivalent to 

WvWtv > NIItv + (y- x,s) 

for each fc-dimensional vector s such that s = L T w for some w G W x . This 
set is described by (A.l) and is, therefore, <9||x||tv- d 

Proof of Lemma 2.1. From the subgradient condition (2.1) with Ai jn = 
0, we obtain 

3 k is - , 3 ieBj 3 a Bj 

Using (A.l), a simple telescoping argument leads to 

!2, if (Vj+i - dj) > 0, (Vj - Vj-i) < 0, 

-2, if (9 j+1 - Vj) < 0, (Vj - Vj_i) > 0, 
0, if (Vj - %_i)(% + i - Vj) = 1, 
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where ij = min{i : i E Bj}. This gives (2.6). It remains to consider the cases 

j = 1 and j = J. If j = 1, Ei 6 Bi s i = ~ w i2, and if J = ^ T^ieBj s i = w v> 
form which (2.4) and (2.5) follow, respectively. □ 

Proof of Theorem 2.3. Let 

(A.3) Kx 2tn = {J = Jo} n {sgn(/if - Jlf-i) = aga(p% - Vi G Jo} 

and, for 2 < i < n, let d® = [i® — fjtf_ 1 , d{ = fif — and df = E{ — £$_!. 
Using the subgradient conditions (A.l), the event 7^a 2 n occurs if and only 
if 

d\ = A 2 ,n(2sgn(d°) - sgn(c?j_i) - sgn(dj+i)) Mi Jo, 
where, for x = 0, sgn(x) is the set [—1, 1], and 

\di\>0 ViG Jo- 
Next, in virtue of Lemma 2.1, on 1Z\ 2 n we can write 

* = fcB- E yfc + c ?W-io— E yk- c %-D 

J'W fceB° i(i-i) fees ,. 

= d i > + ^0 _ E £fc ~^0 E £ k + C °j(i) ~ C i(i-l)> 

iW fees° (i) i(i-i) keB° (i _ 1} 

where the index identifies the block to which i belongs; that is, 

is the block such that i G Bj^ for all i = 1, . . . , n. Accordingly, = |£>^| 

and Cy/^ denotes the bias term in the fusion estimate as given in Lemma 

2.1, with bj and Vj replaced by b® and v®, respectively, for j = 1, . . . , J$. 
As a result, the event 1Z\ 2 n occurs in probability if both 

(A.4) max | (if | < A 2 ,„|2sgn((i°) - sgn(di_i) -sgn(d m )| < 4A 2 , n 
iiJo 



and 



(A. 5) min 



1 1 

d! i + Jfl~ E £k ~lfi E £ k + c j(i) ~ C )(i-l) 

iW fcei3° fees ,. 



J) 



>0 



hold with probability tending to 1 and n — > oo. 

We first consider (A.4). Notice that, for each 2 < i ^ j < n, Etif = 0, 
Var df = 2ul and 



Cov(df,rff) 



otherwise. 
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For 2 < i < n, let d* ~ A(0, 2<r 2 ) be independent, so that 

E(dfdf) < E(dfd|), for all 2 < i ^ j < n, 
E(df) 2 =E(d*) 2 , forall2<i<n. 

Then, by Slepian's inequality [see, e.g., Ledoux and Talagrand (1991)] 
pfmax|d?| >4A 2n l <IP(max|(i*| >4A 2 „). 

By Chernoff's bound for standard Gaussian variables, followed by the 
union bound 

f A 2 1 
p{maxK| >4A 2 ,„} < 2exp -8^f + log\J£\ , 

which vanishes if condition 1 is satisifed. 

In order to verify (A. 5), it is sufficient to show that, with probability 
tending to 1 as n — > oo, 



max 



£ k~W— £ k+ C °j(i 



iM fcefi ,., J'C*- 1 ) fcei? ,- 



where a n = minjg j \d®\. By the triangle inequality, it is enough to show that 

<a n /2 



(A. 6) max 

ieJo 



b o ^2 £k uo X! £k 



and 

(A.7) max| C ° w - Cj Vi)l<«n/2. 

The previous inequality is implied by the last inequality in condition 2 in 
virtue of the bound 

Next, we turn to (A.6). Set Aj = -J- Y,keB° £ k ~ w 1 — EfceS £ fc> witn 

i) j(i-i) 

i £ J°. Then, EAj = for all i and 

2 

maxVar Aj < 2^_. 

2 

Therefore, letting A* ~ A(0, 2^^), i 6 be independent, we obtain, using 

min 

standard Gaussian tail bounds, 

b° ~ 2 



I v I ^ a '< 
max A,- > — 



]<p{^|X-|>f}<2exp{-^ + log|J 
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Under condition 2, the above probability vanishes. This, combined with 
(A. 7) shows that (A. 5) holds with probability tending to 1 if condition 2 is 
verified. □ 



Proof of Theorem 2.5. It is enough to show that the event 

occurs in probability for n — > oo. Because the conditions of Theorem 2.3 are 
assumed, lim n P{7^A 2 n } = 1> w hich implies that we can restrict our analysis 
to the set 1Z\ 2n , where J = Jq and Bj = Bj, for 1 < j < Jq. Next, from 
Corollary 2.2, it is immediately verified that the fused-lasso solution is 



-FL 



3=1 



where vj = sgn(Pj)(i/j = Ai in )+ is the soft-thresholded version of 9j. There- 
fore, in order to verify the claim, one needs to consider the simpler lasso 
problem applied to the vector V. Inspecting the sub-gradient condition for 
this problem, and by arguments similar to the ones used above, it follows 
that lim n ¥(TZ\ 1 n ) = 1 obtains provided both 



(Ai 



max 



h o £i + c i 



< Ai 



and 
(A.9) 



max 



h o 2^ £i + c i 



Al, 



< Pn 



hold with probability tending to 1 as n — > oo, where the quantities Cj are 
given in Lemma 2.1. Letting Xj = -njJ2 



e 



i, notice that Xj ~ iV(0, and 



that (X\, . . . , Xj Q ) are independent. Then, a combination of the Chernoff's 
and the union bounds yields 



max 



1 

J ieB" 



> 



A 



l,n 



< V] exp 



< 



exp 



A? „&2. 



l,n"min 

8a 2 , 



+ log|J5g 



and 



max 



> ^ 



< exp 



18o-2 J 



< exp 



r n mm 

' 18a2 



+ log|.75 c 
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which give large deviations bounds for the error sums in (A. 8) and (A. 9). 
Conditions 1 and 3 guarantee that the above probabilities vanish for n — > oo. 
Thus, with the additional conditions 2 and 4, the inequalities (A. 8) and (A. 9) 
are verified in probability. □ 



Proof of Proposition 2.6. The proof is virtually identical to the 
proof of Theorem 2.5, the main differences stemming from the facts that the 
bias terms Cj = for all 1 < j < Jq and 



1 



V ei ~N(0,a 2 n ) 



We omit the details. □ 



Proof of Theorem 2.7. Let fi F be the fusion estimate using the 
penalty \2, n - Then, because of assumption (2.14), and with the specific 
choice of A2, n and a 2 given in the statement, it can be verified that the 
conditions of Theorem 2.3 are met. Thus, the event 

T = { J = J } n {Bj = B°j, 1 < j < J } 

has probability arbitrarily close to 1 , for all n large enough. On this event 
we next investigate the adaptive fuseddasso fi. Because ]2 is the minimizer 
of (2.10), for any fi G R n , 

lli" AF -/i|li + 2^Ai|/2 i | < 11/2^-^111 + 2^^1^1, 

i i 

where /2 AF and A are given in (2.9) and (2.11), respectively. Adding and 
subtracting fiQ inside both terms \\fi. AF — an d ||/2 AF — Hli yields 

(A. 10) ||jE2-^o||! < 11^- Will! + 2X^(1^1 ~ lM*D + 2 ( £ *^-A i ), 

i 

where, on e* = /2 AF - fj, = E/=i Xjl B o , with Xj ~ N(0, ^) and (X 1: ..., 
Xjo) independent. Next, consider the sub-event A C T given by 

•A = {|e*| < Aj, for each % = 1, . .. ,n} 

= {\Xj\ < \i, n /\Jtfj, for each j = 1, . . . , J }. 

Then, 

P(^) = p{max|0| < Ai, n }, 
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where (C,--->Gb) are i-i-d. -/V(0, <r^). Notice that because of the choice of 
Ai jn , lim n P.4 = 1 by standard large deviation bounds for Gaussians (see 
also the proof of Theorem 2.3). Next, on A, we have 

(A.ll) 2<e*,/2-/i) <2 E Ai|/2i-^|+2 E Ai|&|. 

The decomposition 

2 5Z^i(lA*i| _ \V-i\) = 2 X! ^iMjl -2 E E MfiiV 

i ies(n) ieS(n) f£S{n) 

along with (A.ll) and the triangle inequality, yields, on A, 

2^2 Aj(|/ij| - + 2{e*,p, - fi) < 4 E - m\. 

The previous display and (A. 10) lead to the inequality 
(A. 12) ||m — A*o||| <[!>"• — A*o||l H- 4 E Wfii-fi>i\ 

valid on A. Next, it is easy to see that 
and, in particular, 

E \ ? = A?|J5( M °)| ) 

ieS(/x) 

if and only if JcS(^) = JS^ ). 

Therefore, by the Cauchy-Schwarz inequality, the second term on the 
right-hand side of (A. 12) can be bounded on A as follows: 



4 E Ail/ij <4Ai >n y|J r 5(^;/i°)| ll/i-zi^. 
Then, using the triangle inequality, (A. 12) becomes 



1 1 A* — A*o 1 1 2 < Mm — Mo II 2 + 4 Ai in -y/| l 75(/z;/i°)|(|| / u- /x || 2 + \\fM - fJ>h)- 

On A, the same arguments used in the second part of the proof of Lemma 3.7 
in van de Geer (2007) establish the inequality in the claim. Since lim n P(_4) = 
1, the first result follows. □ 

Proof of Theorem 3.1. Let N(S,J r n ,\\ • || n ) denote the <5-covering 
number of the set J- n C K n with respect to the norm || * || n and notice that, 
for any C > 0, 

N{5,CT n ,\\-\\ n ) = N(^T n ,\\-\ 
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Furthermore, observe that Ctv(T u ) = T n C(l). By a theorem of Birman and 
Solomjak (1967) [see, e.g., Lorentz, Golitschek and Makovoz (1996), Theo- 
rem 6.1], the 5-metric entropy of CTv(^n) with respect to the L 2 (P n ) norm 
is 

T 



for some constant C independent of n. Letting ^{5) = J yCy = \/T n C5, 
the solution to 

gives 



5n> 



T l/3 
-L n. 



n 



where the symbol > indicates inequality up to a universal constant. The 
result now follows from Theorem 3.4.1 of van der Vaart and Wellner (1996) 
(see also the discussion on pages 331 and 332 of the same reference). In order 
to establish (3.2), we use Lemma 4.3 in Loubes and van de Geer (2002) to 
get that the metric entropy of Ct x (Ln) is 

H(5, C tl (L„), || • || n ) < flog n + log 



n5 2 V \fnb ) 

for some constant C independent of n. Notice that the entropy integral of 



H{5,Cn 1 (L n ), || • || n ) diverges on any neighborhood of 0. By Theorem 9.1 

in van de Geer (2000), the rate of consistency 5 n for fi L with respect to the 
norm || • || n is given by the solution to 

(A.13) V^<£ >*(*»), 

where 



> I*" jH(x,C £l (L n ))dx 
J AS?. v 



with A a constant independent of n. Equation (A.13) is satisfied for a se- 
quence 5 n satisfying 



Vn5 n > = — log l/6 n , 

\/n 



which gives the rate (3.2). □ 
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Proof of Theorem 3.2. Let H(5 n ,C(S n ,J n ), \\ • \\ n ) denote the metric 
entropy of C(S n ,J n ) with respect to the norm || • || n . By Lemma A. 2 and 
assumption (C2), for S n < 1, the equation 



> J ^logH(x,C(S n ,J n ), 



I dx 




leads to 



because S n = o(jj^) and j n < s n . The sequence 5 n = J ^ satisfies the con- 
ditions of Theorem 3.4.1 of van der Vaart and Wellner (1996), thus proving 
(3.5). The second claim in the theorem is proved similarly, where the left- 
hand side of (A. 14) in Lemma A. 2 is now bounded by C\^ n only. □ 



Lemma A. 2. For the distance induced by the norm \\x\\ n = -4^y^22 = ixf , 
the metric entropy of C(S n , J n ) satisfies 

(A.14) H(6, C{S n , J n ), || • y < Ci, n + C 2 , n , 



where 



Cl, n = ^ log R n + J n {log ^ + l - log S n 



and 



C 2 ,n = log S n + S n log n. 

Proof. For fixed 6 > 0, we will construct an <5-grid of C(S n , J n ) based 
on the Euclidean distance. For every choice of S n nonzero entries of fi, we 
regard vector in M. Sn which is block-wise constant with J n blocks. 

Then, there exist J n positive integer numbers di, . . . , dj n such that J2i di = 
S n and one can think of \x as the concatenation of J n vectors fj,\, . . . , [ij n each 
having constant entries, where [L\ G M. dl , 1 = 1,..., J n . Each m can be any 
point along the main diagonal of the ^-dimensional cube center at with 
edge length R^! Sn and volume Rn^ Sn . The length of the main diagonal of 
each such cube is R^ Sn \fd~i. Therefore, for any specific choice of S n nonzero 
coordinates, the slice in the corresponding 5 n -dimensional cube centered at 
and with edge length R n n consisting of the set of vectors in B n with 
discontinuity profile (d\ , . . . , dj n ) is the set 

1=1 
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where £(R,di) denotes the closed line segment in R 5 ™ between the points 
7Trf ; (li?) and 7r^(— 1R), where 1 is the 5 n -dimensional vector with coor- 
dinates all equal to 1 and tt^ the function from M. Sn onto M. Sn given by 
^di 0*0 = y with yi = for i < Y,j=\ d i - 1 or % > Ylj=i di and yi = Xi oth- 

i / a i / c 

erwise. Notice that the length of each £(R n n ,di) is precisely R n "v$. 
If J n = S n , lZ n is the 5 n -dimensional cube centered at with volume R n , 
while if J n < k n the set lZ n is a hyper-rectangle (not full dimensional) which 
can be embedded as a hyper-rectangle in W Jn centered at and with edge 
lengths equal to the lengths of £(Rn kn ,di), for I = 1, . . . , J n . As a result, it 
is immediate to see that the volume of lZ n can be calculated as 

Y[R}l s "Vdi = Rt /Sn X[JJi. 

i i 

Next, partition each of the J n perpendicular sides of lZ n into intervals 
of length 5^1^, l = l,...,J n . This gives a partition of 7Z n into smaller 

hyper-rectangle of edge lengths &\p§-, for 1 = 1,..., J n . Every point in 1Z n 
is within Euclidean distance 5 from the center of one of the small hyper- 
rectangles, which therefore form an 5-grid for lZ n . By a volume comparison, 
the cardinality of such a grid is 

R J n n/Sn UlVdi _( Rl/Sn ^ Jn 



UiSVdi/s; 

For fixed S n , the number of distinct block patterns with cardinality at 
most J n is equal to the the number of nonnegative solutions to d\ + c?2 + 
• • • + dj n = S n , which can bounded as 



j J < (S n + Jn ~ 1) 



[see, e.g., Stanley (2000)]. Thus, the logarithm of cardinality of this 5-grid 
is 

(A.15) j- log R n + Jn (log ~ § + X ~ log S r ^j + J n \og(S n + J n ~ 1) . 

Next, the number of subsets of {1, . . . , n} of size at most S n is 

i=l v 7 

Thus, the logarithm of the cardinality for an 5 grid over B n is bounded by 
(A.15) plus the quantity 

log5 n + J n log n. 

The result for the || • || n norms now follows by replacing 5 with 5/y/n in 
(A.15). □ 
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